Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CallWithTimeout and related. #331

Merged
merged 4 commits into from
Nov 17, 2015

Conversation

stevelinton
Copy link
Contributor

This PR adds a GAP level function

CALL_WITH_TIMEOUT( <seconds>, <useconds>, <fn>, <args> )

<seconds> and <useconds> must be small ints

The return value is

fail -- ran out of time
[] -- returned within the timeout, no result
[ <result> ] -- returned a result within the timeout

The timer is paused during a break loop (and might even by nestable) and reset on quitting from a break.
There are some support functions:

TIMEOUTS_SUPPORTED(); -- returns true if timeouts could be compiled in this system and false otherwise

STOP_TIMEOUT(); -- stops the timer and returns a length 2 list [<seconds>, <useconds>]
RESUME_TIMEOUT( <state> ); -- takes such a list and sets the timer to it and starts it.

these are used in the break loop handling.

This is all intended to be wrapped up at GAP level into something a little bit nicer. There are two implementations behind the scenes, depending on your OS -- one using setitimer and a nicer one using the newer POSIX interface -- timer-create() etc. The second of these could be made to work sensibly in HPC-GAP, I think, but is not available on a Mac.

I'd welcome testing.

@ChrisJefferson
Copy link
Contributor

Some minor issues:

Passing 0,0 as the seconds/microseconds makes the loop go forever.

The following code segfaults gap:

gap> f := function() with true do od; end;;
gap> CALL_WITH_TIMEOUT(10,0,f,[]);
<press ctrl+c>
brk> CALL_WITH_TIMEOUT(10,0,f,[]);
<press ctrl+c again>
brk_2>
<press ctrl+d>
brk> return;
'gap' terminated by signal SIGABRT

@stevelinton
Copy link
Contributor Author

Setting the kernel timer to 0 is how you unset it, which explains the first one. This is probably best dealt with in the GAP wrapper function.

I can also reproduce the crash on OS X. It would be useful to know if anyone sees it on Linux.

Also does any of this work under cygwin at all?

@stevelinton
Copy link
Contributor Author

Chris's crash is messy. I'm not yet 100% sure of the details, but essentially when the user quits from the inner break loop, the longjump buffer used for the timeout is left with its value from the inner timeout, which is invalid since it comes from stack frame that we have left (via the separate longjump used for error handling) so when the outer timeout fires, we try to longjump to a place that no longer exists, and segfault.

The only really nice way to fix this is a full review of the whole error/exception handling concept, and the related business of saving and restoring reader and execution state. Ultimately this would be nice but it will be quite hard and most likely break a few things along the way.

In the meantime, we either add another piece of sticking plaster by trying to find all the right places to save and restore the timeout jump buffer, or we avoid the problem by banning nested timeouts, or something like that.

@stevelinton
Copy link
Contributor Author

Have added sticking plaster. Not quite as complex as I thought.

@stevelinton stevelinton force-pushed the sl/alarm branch 2 times, most recently from b9e9a82 to ff39f83 Compare November 4, 2015 11:14
@stevelinton stevelinton changed the title Add CALL_WITH_TIMEOUT and related functionality (WIP, do not merge) Add CallWithTimeout and related. Nov 4, 2015
@stevelinton
Copy link
Contributor Author

As far as I know, this is now ready for merge, although more testing is always good. It has documentation and a basic testfile.

@ChrisJefferson
Copy link
Contributor

Can make this code crash about 1 time in 3 with the following nasty invocation:

List([1..1000], x -> CALL_WITH_TIMEOUT(0, x, Intersection,[Group((1,2)),Group((2,3))]));

Looks like an incorrect longjmp is occuring (there is no backtrace, just longjmping into nowhere).

@stevelinton
Copy link
Contributor Author

@ChrisJefferson I can't reproduce this.

@ChrisJefferson
Copy link
Contributor

Did you try popping it in a while do .. od? I'll see if I can track down the issue myself (tried make distclean, still seems to pop up). Am on Mac (I assume you are too).

@stevelinton
Copy link
Contributor Author

I can reproduce it now, although rarely. I'll have a look at the code. There might be a race condition around setting up the jump buffer.

@stevelinton stevelinton closed this Nov 4, 2015
@stevelinton stevelinton reopened this Nov 4, 2015
@ChrisJefferson
Copy link
Contributor

The problem is that a we try calling with 'NumAlarmJumpBuffers' = 0, which I think should never happen. This suggests an alarm is getting called after we disabled the alarms (might be something written somewhere that that can happen.

@stevelinton
Copy link
Contributor Author

Think I know what the problem is. If the signal comes in after we execute the last statement of the function, we are left with InterruptExecStats having been called and SyAlarmHasGoneOff set to true,
but nowhere to longjump to.

I can fix.

@ChrisJefferson
Copy link
Contributor

Not sure what is going on with travis's timing -- seems to be going crazy but not sure how it's going crazy.

@stevelinton
Copy link
Contributor Author

OK. I think I've fixed the race condition. Travis timings probably relate to which timers see the system CPU time associated with calls to RunTime(). Possibly need to rethink how to test this code.

@ChrisJefferson
Copy link
Contributor

I tried injecting some 'Runtimes()' calls into the test to see where the time is being spent, it looks like there isn't any great amount of time being spent anywhere:

########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 18:
# Input is:
Runtimes();
# Expected output:
# But found:
rec( system_time := 299, system_time_children := 0, user_time := 2523, 
  user_time_children := 0 )
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 19:
# Input is:
CallWithTimeout(5000,spinFor,1,1);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou\
t.tst, line 21:
# Input is:
Runtimes();
# Expected output:
# But found:
rec( system_time := 307, system_time_children := 0, user_time := 2523, 
  user_time_children := 0 )
########

@stevelinton
Copy link
Contributor Author

That’s 8ms of system time, which is 8000 microseconds, so the alarm might go off if that is being counted.

Steve

On 4 Nov 2015, at 16:41, Christopher Jefferson [email protected] wrote:

I tried injecting some 'Runtimes()' calls into the test to see where the time is being spent, it looks like there isn't any great amount of time being spent anywhere:

########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 18:

Input is:

Runtimes();

Expected output:

But found:

rec( system_time := 299, system_time_children := 0, user_time := 2523,
user_time_children := 0 )
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 19:

Input is:

CallWithTimeout(5000,spinFor,1,1);

Expected output:

[ 1 ]

But found:

fail
########
########> Diff in /home/travis/build/ChrisJefferson/gap/tst/testinstall/timeou
t.tst, line 21:

Input is:

Runtimes();

Expected output:

But found:

rec( system_time := 307, system_time_children := 0, user_time := 2523,
user_time_children := 0 )
########


Reply to this email directly or view it on GitHub.

@ChrisJefferson
Copy link
Contributor

Woops, I got my units wrong, you are right.

So just change spinFor to something like this?

gap> spinFor := function(ms, arg) local t; t := Runtime();
Syntax warning: New syntax used -- intentional? in stream line 1
spinFor := function(ms, arg) local t; t := Runtime();
                                      ^
> t := Runtimes().user_time + Runtimes().system_time;
> while Runtimes().user_time + Runtimes().system_time < t + ms do od;
> if Length(arg) >= 1
> then return arg[1]; else return; fi; end;
function( ms, arg ) ... end

@stevelinton stevelinton mentioned this pull request Nov 5, 2015
@rbehrends
Copy link
Contributor

I am wondering whether it would be a good idea to add SA_RESTART to the signal flags, since code that uses CallWithTimeOut() may not be prepared to deal with errors resulting from an interrupted syscall.

I also think there is a potential race condition (including in the existing code for handling SIGINT). Remember that the compiler is unaware of signals and can reorder assignments to RealExecStatCopied. This can be resolved (I think, haven't fully verified it) by making RealExecStatCopied a volatile sig_atomic_t variable (this may require testing for the availability of sig_atomic_t and falling back to int or char instead where unavailable).

@stevelinton
Copy link
Contributor Author

Thanks.
While you’re here, am I right in thinking that the timer_create based implementation should be fairly easily adaptable to HPCGAP? I was thinking you could create one timer per thread, with the options that count thread user + system CPU and the signals ought to be delivered correctly. The setitimer version doesn’t look so adaptable, which is annoying for OSX users.

Steve

On 5 Nov 2015, at 17:39, Reimer Behrends [email protected] wrote:

I am wondering whether it would be a good idea to add SA_RESTART to the signal flags, since code that uses CallWithTimeOut() may not be prepared to deal with errors resulting from an interrupted syscall.

I also think there is a potential race condition (including in the existing code for handling SIGINT). Remember that the compiler is unaware of signals and can reorder assignments to RealExecStatCopied. This can be resolved (I think, haven't fully verified it) by making RealExecStatCopied a volatile sig_atomic_t variable (this may require testing for the availability of sig_atomic_t and falling back to int or char instead where unavailable).


Reply to this email directly or view it on GitHub.

@rbehrends
Copy link
Contributor

Setting up signal delivery and the timing mechanism should work the same way as in the single-threaded version (assuming that per-thread CPU accounting is actually correct and not faked [1]); however, the actual signal delivery and handling is complicated when you have multiple threads (note that this is also the case, if, say, a GAP package starts a new thread [2]). The underlying problem is that POSIX doesn't guarantee that a signal is delivered to a specific thread, but allows the OS to pick any thread (even randomly) that doesn't block the signal.

The usual solution is to block the signal in all but a dedicated signal handler thread and have that signal be forwarded by that thread to the thread it was intended for via pthread_kill() or a dedicated interrupt delivery mechanism. As it happens, HPC-GAP already has both a signal handler thread and an interrupt delivery mechanism (though pthread_kill() should also work for this use case and result in a smaller diff). Look for SIGWAIT (and attendant mechanisms) in lib/thread.g and src/threadapi.c; InterruptThread() in src/thread.c is a general means to deliver a GAP-level interrupt to another thread (it's how SIGINT is currently handled).

One final caveat: The current implementation of CALL_WITH_TIMEOUT() seems to jump past CALL_WITH_CATCH() setjmps (correct me if I'm wrong). This may be intended, but in HPC-GAP, error handling there also pops entries off the lock stack. A longjmp would therefore have to copy the lock stack handling semantics (trivial, but it has to be done).

[1] This can be simulated, in the same way that profilers do sample-based profiling, but that may become hairy.

[2] I.e. if a GAP package did this, it would have to block signals that the main thread is prepared to handle; this is not a problem in HPC-GAP, which blocks the signals it handles itself in the main thread and worker threads, and new threads inherit the signal mask of the creating thread.

@stevelinton
Copy link
Contributor Author

OK Thanks for the information.

@olexandr-konovalov
Copy link
Member

@stevelinton tested on Windows VM, compiling and running GAP in Cygwin shell. It builds without warnings and errors, but then timeout.tst throws a bunch of Error, Could not set interval timer errors like here:

########> Diff in /home/jenkins/gap/tst/testinstall/timeout.tst, line 14:
# Input is:
CallWithTimeout(5000,spinFor,1);
# Expected output:
[  ]
# But found:
Error, Could not set interval timer
########

@stevelinton
Copy link
Contributor Author

@alex-konovalov Great. So cygwin provides either setiimer or timer-create, but then it doesn't work.

Can you get the compile log from that system?

Even better, can you try to generate this error interactively and tell me what LastSystemError shows?

Thanks

@ChrisJefferson
Copy link
Contributor

Thanks @alex-konovalov , quick google reveals cygwin doesn't do ITIMER_VIRTUAL, we'll have to use ITIMER_REAL on cygwin (which does measure wall time rather than process time, but should be a reasonable substitute)

@olexandr-konovalov
Copy link
Member

Well, just replacing ITIMER_VIRTUAL by ITIMER_REAL in two places where it's used doesn't help...

@ChrisJefferson
Copy link
Contributor

Still error in the same place?

@olexandr-konovalov
Copy link
Member

@ChrisJefferson: yes :( I gave you the wrong fragment of code. It happens here:

  if (timer_settime(syTimer, 0, &tv, NULL)) {
    signal(TIMER_SIGNAL, SIG_DFL);
    ErrorReturnVoid("Could not set interval timer", 0L, 0L, "you can return to ignore");
  }
  • this is the version being used when #if HAVE_TIMER_CREATE && HAVE_SIGACTION

@olexandr-konovalov
Copy link
Member

@stevelinton The No such file or directory. seems to be meaningful. I've set the breakpoint at timer_settime and that's what I observe:

gap> CallWithTimeout(5000,spinFor,1);

Breakpoint 1, timer_settime (timerid=0, flags=0, value=0x28c0a0, ovalue=0x0)
    at /usr/src/debug/cygwin-1.7.35-1/winsup/cygwin/timer.cc:338
338     {
(gdb) n
341       __try
(gdb) n
338     {
(gdb) n
341       __try
(gdb) n
344           if (tt->magic != TT_MAGIC)
(gdb) n
341       __try
(gdb) n
338     {
(gdb) n
341       __try
(gdb) n
__sjfault () at sigfe.s:233
233     sigfe.s: No such file or directory.

@stevelinton
Copy link
Contributor Author

They must be using it to mean “no such timer”. I would guess that the timer-create call in InitAlarms, the return value of which, mea culpa, is not being checked, must be failing.

I imagine it is more-or-less the same as what Chris reported — CPU timing isn’t supported, so we have to use CLOCK_REALTIME instead of one of the others.

Steve

On 6 Nov 2015, at 21:12, Alexander Konovalov [email protected] wrote:

@stevelinton The No such file or directory. seems to be meaningful. I've set the breakpoint at timer_settime and that's what I observe:

gap> CallWithTimeout(5000,spinFor,1);

Breakpoint 1, timer_settime (timerid=0, flags=0, value=0x28c0a0, ovalue=0x0)
at /usr/src/debug/cygwin-1.7.35-1/winsup/cygwin/timer.cc:338
338 {
(gdb) n
341 __try
(gdb) n
338 {
(gdb) n
341 __try
(gdb) n
344 if (tt->magic != TT_MAGIC)
(gdb) n
341 __try
(gdb) n
338 {
(gdb) n
341 __try
(gdb) n
__sjfault () at sigfe.s:233
233 sigfe.s: No such file or directory.


Reply to this email directly or view it on GitHub.

@ChrisJefferson
Copy link
Contributor

My windows install is currently broken -- once I fix it, I'll figure out exactly what does and doesn't work in cygwin.

@stevelinton stevelinton force-pushed the sl/alarm branch 3 times, most recently from ba07060 to b8fdb15 Compare November 13, 2015 11:47
@stevelinton
Copy link
Contributor Author

The current version should in theory work on cygwin.If it does, then as far as I know it's ready to merge, although it might be sensible to squash the commits a bit.

@ChrisJefferson
Copy link
Contributor

Yep, works fine with cygwin

@olexandr-konovalov
Copy link
Member

@stevelinton @ChrisJefferson thanks - that was the only issue remaining, so let's merge this then.

Modify test file to hopefully please travis

Add warning to documentation that a timeout may leave things inconsistent.

Some robsutness improvements suggested by Reimer

Check return value from timer_create
Make alarms work with TakeInterrupt()

This allows long, but cooperative kernel functions to be interrupted.
olexandr-konovalov pushed a commit that referenced this pull request Nov 17, 2015
Add CallWithTimeout and related.
@olexandr-konovalov olexandr-konovalov merged commit 3ab25a4 into gap-system:master Nov 17, 2015
@olexandr-konovalov
Copy link
Member

No further comments for 4 days - merged now. Thanks

@olexandr-konovalov
Copy link
Member

@stevelinton interestingly, the test passes on Travis infrastructure, but fails on all three machines that we use for Jenkins nightly tests:

########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 14:
# Input is:
CallWithTimeout(5000,spinFor,1);
# Expected output:
[  ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 18:
# Input is:
CallWithTimeout(5000,spinFor,1,1);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 20:
# Input is:
CallWithTimeoutList(5000,spinFor,[1,1]);
# Expected output:
[ 1 ]
# But found:
fail
########
########> Diff in /mnt/disk2/hudson-slave/workspace/GAP-dev/GAPCOPTS/64build/G\
APGMP/gmp/GAPTARGET/install/label/64bit/GAP-git-snapshot/tst/testinstall/timeo\
ut.tst, line 24:
# Input is:
CallWithTimeoutList(5000,spinFor,[1]);
# Expected output:
[  ]
# But found:
fail
########

@olexandr-konovalov olexandr-konovalov added this to the GAP 4.8.1 milestone Nov 20, 2015
@olexandr-konovalov
Copy link
Member

@stevelinton could it be a race condition somewhere? The longer is the timeout, the more robust it is. For example, I have

gap> CallWithTimeout(50000,spinFor,1);
[  ]

each time, with the 1st argument being 10000 I can get fail time to time, and with the 1st argument being only 1000 I got [ ] one time out of twenty.

@rbehrends
Copy link
Contributor

Most likely, it's just the timeout being hit. As I understand it, the value is given in microseconds, and 5000 µs = 5 ms can easily fail if the machine is just a bit busy (or if the OS doesn't support high resolution timers).

@olexandr-konovalov
Copy link
Member

@rbehrends I think you're right... so I suggest to increase the timeout in the failing calls from 5000 to 50000 to make the test reproducible.

@coveralls
Copy link

Coverage Status

Changes Unknown when pulling 3ac7ce3 on stevelinton:sl/alarm into ** on gap-system:master**.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants